A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts

نویسندگان

  • Sushant Kafle
  • Matt Huenerfauth
چکیده

Motivated by a project to create a system for people who are deaf or hard-of-hearing that would use automatic speech recognition (ASR) to produce real-time text captions of spoken English during in-person meetings with hearing individuals, we have augmented a transcript of the Switchboard conversational dialogue corpus with an overlay of word-importance annotations, with a numeric score for each word, to indicate its importance to the meaning of each dialogue turn. Further, we demonstrate the utility of this corpus by training an automatic word importance labeling model; our best performing model has an F-score of 0.60 in an ordinal 6-class word-importance classification task with an agreement (concordance correlation coefficient) of 0.839 with the human annotators (agreement score between annotators is 0.89). Finally, we discuss our intended future applications of this resource, particularly for the task of evaluating ASR performance, i.e. creating metrics that predict ASR-output caption text usability for DHH users better than Word Error Rate (WER).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic lexicon generation and dialogue modeling for spontaneous speech

This paper describes novel framework for dialogue modeling based on a superword model, a superset of word n-gram. This has a remarkable advantage, because only transcribed text is needed to obtain the model, and no word dictionary is needed. In this paper, it is shown that the expressions specific to dialogue speech are extracted automatically from the transcriptions of spoken dialogue corpora ...

متن کامل

Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems

We present a corpus of spoken dialogues between students and an adaptive Wizard-of-Oz tutoring system, in which student uncertainty was manually annotated in real-time. We detail the corpus contents, including speech files, transcripts, annotations, and log files, and we discuss possible future uses by the computational linguistics community as a novel resource for studying naturally occurring ...

متن کامل

Unsupervised Alignment for Segmental-based Language Understanding

Recent years’ most efficient approaches for language understanding are statistical. These approaches benefit from a segmental semantic annotation of corpora. To reduce the production cost of such corpora, this paper proposes a method that is able to match first identified concepts with word sequences in an unsupervised way. This method based on automatic alignment is used by an understanding sy...

متن کامل

High Frequency Word Entertainment in Spoken Dialogue

Cognitive theories of dialogue hold that entrainment, the automatic alignment between dialogue partners at many levels of linguistic representation, is key to facilitating both production and comprehension in dialogue. In this paper we examine novel types of entrainment in two corpora—Switchboard and the Columbia Games corpus. We examine entrainment in use of high-frequency words (the most comm...

متن کامل

Classification of discourse functions of affirmative words in spoken dialogue

We present results of a series of machine learning experiments that address the classification of the discourse function of single affirmative cue words such as alright, okay and mm-hm in a spoken dialogue corpus. We suggest that a simple discourse/sentential distinction is not sufficient for such words and propose two additional classification sub-tasks: identifying (a) whether such words conv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1801.09746  شماره 

صفحات  -

تاریخ انتشار 2018